-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example CRD for katib operator #119
Conversation
I think operator-pattern would fit very nicely into Katib.
/assign @jlewi |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@inc0: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
ref #6 |
We have a backend DB MySQL, thus I am not sure if operator works for us. If the changes about study and trial are not frequent, we could try to use etcd in Kubernetes. If they are changed frequently, maybe custom apiserver is better than crd operator. IMO |
Problem I have with DB backend is that it's another (hard!) component to maintain. MySQL on k8s is non-trivial at best. Even if you'd use something like Vitess it's hard. ETCD is always there on the other hand. I don't see changes to a Trial being more frequent than couple times per training run, which, even at most sophisticated environments will grow to hundreds a day maybe? ETCD should handle it without breaking a sweat. |
This proposal is really two things
These seem to be addressing two different points and it might be better to split them. For 1; what problem; benefit is a CRD providing? IIUC the way Katib works is you have two pieces of information
So right now if a user wants to launch an HP job
So it seems like the biggest hurdle right now is that users need to write a go program to control their experiment. It looks like their is some logic that is different (see here) for each suggestion algorithm. Furthermore, I'm guessing there is work to add support for more advanced features like early stopping. It seems like a CRD would offer a bit of syntatic sugar by letting people create a single resource rather than creating a deployment + config map. Of course that wouldn't help anyone who wanted to customize the logic or add more advanced HP logic. Right now the code for the controller controller is pretty simple. So if we implement it as a CRD we should use meta controller or some other pattern to keep the amount of code fairly tractable and not incur all the boilerplate we do for other controllers like tf-operator. IMO though the big win is making the go program more powerful so that it can cover more use cases. So I wouldn't prioritize creating a CRD ahead of making those improvements. For 2 Ideally Katib shouldn't be overly prescriptive about the storage backend. Katib should define an interface that should support using a variety of storage solutions. What is the current interface? Is it SQL? Or does Katibe define some interface around SQL? If its SQL should we leave it as SQL? Should we wrap it in some higher level API (e.g. a CRUD) API so that its easier to add support for different backends? |
Right. First point. Second point. |
I think, actually, that 2 is more important. Reason for that is that, after we're done with model training, we will want to test it or serve it. That could use some cooperation between other kubeflow components like tf-serving.
I think at least 1 can be prescribed by us and set to ETCD, 2 shouldn't ever be. That being said, 1 can hold information regarding 2 including path and type of storage. That would allow multiple components to act on same model. |
@inc0 Is there a particular reason you are trying to persuade everyone that ETCD is the right storage? As discussed above I don't think we want to be overly prescriptive about the storage backend. So if you feel strongly that ETCD is a good backend then why not just create an implementation of the go interface for the DB that uses ETCD? |
It looks to me like we need to a bunch of work to allow controller programs e.g. git-issue-summarize-demo.go to be reusable across models. Until programs are reusable I'm not sure a CRD would help. Two items and examples that come to mind are
I think it would be really helpful to have a concrete example working E2E that would help illuminate what improvements (e.g. adding a CRD) would be useful. |
@jlewi ad ETCD, reason I think ETCD is right storage is because everybody has it. Supporting relational databases is hard problem, especially on k8s. Our mission is for kubeflow to run on every k8s, that will make it much harder. Without external database or (very hard to support) database on k8s, you won't be able to use Katib. Not to mention that if you deploy modeldb you also get mongodb to support, also not easy.
That's one of reason for CRD, much like injecting pod spec to tfjob
As you've said, tf.Events integration. |
I think this is best addressed by #138 a design doc for Katib.
Ubiquity is not the only consideration. Lifcycle, scalbility, etc... they all matter. Storing it in the K8s master really doesn't strike me as the right solution for Cloud since in cloud K8s clusters can be ephemeral. That's why I'd like the storage backend to be pluggable so that on Cloud we can swap in appropriate storage solutions. Most Clouds offer a variety of managed storage options; I'm pretty sure every cloud has some managed SQL database. So on Cloud using an external DB is a very good option. For OnPrem I'm not sure. That's why I think we should focus on getting the right interface and making the storage pluggable rather than pushing a single solution for everyone. |
Right now we'll have issue building Katib on OnPrem. Managed databases might work, but unless you do have managed database, you're screwed. We also deploy database in ksonnet which gives illusion of stability. I think if we write too many backends, we'll dig a grave for ourselves. Pluggability on that level will be very hard to maintain. If anything, let's try with MongoDB because it's used by ModelDB. I still think we can handle it with ETCD under K8s tho. I don't think it's ephemeral... |
@YujiOshima has a StudyController CRD in #141 Closing this issue in favor of that. |
I think operator-pattern would fit very nicely into Katib. This change shows example resource set that would be enough (I think) to create full set of katib-driven jobs.
This approach will also allow to leverage fully tf-operator, will help to reuse katib-provided models and with model provenance. Also operators will have easier time tracking and monitoring jobs.
This change is